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Abstract. In this work we describe an efficient implementation of a hierarchy of 
algorithms for the decomposition of dense matrices over the field with two elements 
(F2). Matrix decomposition is an essential building block for solving dense systems 
of linear and non-linear equations and thus much research has been devoted to 
improve the asymptotic complexity of such algorithms. In this work we discuss an 
implementation of both well-known and improved algorithms in the M4RI library. 
The focus of our discussion is on a new variant of the M4RI algorithm - denoted 
MMPF in this work - which allows for considerable performance gains in practice 
when compared to the previously fastest implementation. We provide performance 
figures on x86_64 CPUs to demonstrate the viability of our approach. 



1 Introduction 

We describe an efRcient implementation of a hierarchy of algorithms for PLS decompo- 
sition of dense matrices over the field with two elements (F2). The PLS decomposition 
is closely related to the well-known PLUQ and LQUP decompositions. However, it of- 
fers some advantages in the particular case of F2. Matrix decomposition is an essential 
building block for solving dense systems of linear and non-linear equations (cf. [11, 10]) 
and thus much research has been devoted to improve the asymptotic complexity of such 
algorithms. In particular, it has been shown that various matrix decompositions such as 
PLUQ, LQUP and LPS are essentially equivalent and can be reduced to matrix-matrix 
multiplication (cf. [13]). Thus, we know that these decompositions can be achieved in 
0(n") where w is the exponent of linear algebra^. In this work we focus on matrix de- 
composition in the special case of F2 and discuss an implementation of both well-known 
and improved algorithms in the M4RI library [2]. The M4RI library implements dense 
linear algebra over F2 and is used by the Sage [16] mathematics software and the Poly- 
BoRi [9] package for computing Grobner bases. It is also the linear algebra library used 
in [15,14]. 

Our implementation focuses on 64-bit x86 architectures (x86_64), specifically the Intel 
Core 2 and the AMD Opteron. Thus, we assume in this chapter that each native CPU 
word has 64 bits. However it should be noted that our code also runs on 32-bit CPUs and 

on non-x86 CPUs such as the PowerPC. 

* This author was supported by the Royal Holloway Valerie Myerscough Scholarship. 
^ For practical purposes we set u) = 2.807. 



Element-wise operations over F2 are relatively cheap compared to loads from and 
writes to memory. In fact, in this work we demonstrate that the two fastest implemen- 
tations for dense matrix decomposition over F2 (the one presented in this work and the 
one found in Magma [8] due to Allan Steel) perform worse for sparse matrices despite 
the fact that fewer field operations are performed. This indicates that counting raw field 
operations is not an adequate model for estimating the running time in the case of F2. 

This work is organised as follows. We will start by giving the definitions of reduced row 
echelon forms (RREF), PLUQ and PLS decomposition in Section 2 and establish their 
relations. We will then discuss Gaussian elimination and the M4RI algorithm in Section 3 
followed by a discussion of cubic PLS decomposition and the MMPF algorithm in 4. We 
will then discuss asymptotically fast PLS decomposition in Section 5 and implementation 
issues in Section 6. We conclude by giving empirical evidence of the viability of our 
approach in Section 7. 

2 RREF and PLS 

Proposition 1 (PLUQ decomposition). Any my. n matrix A with rankr, can be writ- 
ten A = PLU Q where P and Q are two permutation matrices, of dimension respectively 
m X m and n x n, L is m x r unit lower triangular and U is r x n upper triangular. 

Proof. See [13]. 

Proposition 2 (PLS decomposition). Anymxn matrix A with rankr, can be written 
A = PLS where P is a permutation matrix of dimension m x m, L is m. x r unit lower 

triangular and S is an r x n matrix which is upper triangular except that its columns are 
permuted, that is S = UQ for U r x n upper triangular and Q is a n x n permutation 
matrix. 

Proof Write A = PLUQ and set S = UQ. 

Another way of looking at PLS decomposition is to consider the A = LQUP de- 
composition [12]. We have A = LQUP = LSP where S = QU. We can also write 
A = LQUP = SUP where S = LQ. Applied to A'^ we then get A = P'^U'^S'^ = P'L'S'. 
Finally, a proof for Proposition 2 can also be obtained by studying any one of the Algo- 
rithms 9, 3 or 4. 

Definition 1 (Row Echelon Form). An m x n matrix A is in echelon form if all 
zero rows are grouped together at the last row positions of the m,atrix, and if the leading 
coefficient of each non zero row is one and is located to the right of the leading coefficient 
of the above row. 

Proposition 3. Any mxn matrix can be transformed into echelon form by matrix mul- 
tiplication. 

Proof See [13] 

Note that while there are many PLUQ decompositions of any matrix A there is always 
also a decomposition for which we have that S = UQ^ is a row echelon form of A. In 



this work we compute A = PLS such that S is in row echelon form. Thus, a proof for 
Proposition 3 can also be obtained by studying any one of the Algorithms 9, 3 or 4. 

Definition 2 (Reduced Row Echelon Form). An m x n matrix A is in reduced 
echelon form if it is in echelon form and each leading coefficient of a non zero row is the 
only non zero element in its column. 

3 Gaussian Elimination and M4RI 

Gaussian elimination is the classical, cubic algorithm for transforming a matrix into (re- 
duced) row echelon form using elementary row operations only. The "Method of the 
Four Russians" Inversion (M4RI) [7] reduces the number of additions required by Gaus- 
sian elimination by a factor of log n by using a caching technique inspired by Kronrod's 
method for matrix-matrix multiplication. 

3.1 The "Method of the Four Russians" Inversion (M4RI) 

The "Method of the Four Russians" inversion was introduced in [5] and later described 
in [6] and [7] . It inherits its name and main idea from the misnamed "Method of the Four 
Russians" multiplication [4, 1] . 

To give the main idea consider for example the matrix A of dimension m x n in 
Figure 3.1. The fc x n (fc = 3) submatrix on the top has full rank and we performed 
Gaussian elimination on it. Now, we need to clear the first k columns of A for the rows 
below k (and above the submatrix in general if we want the reduced row echelon form). 
There are 2*^ possible linear combinations of the first k rows, which we store in a table T. 
We index T by the first k bits (e.g., Oil ^ 3). Now to clear k columns of row i we use the 
first k bits of that row as an index in T and add the matching row of T to row i, causing 
a cancellation of k entries. Instead of up to k additions this only costs one addition due to 
the pre-computation. Using Gray codes (or similar techniques) this pre- computation can 
be performed in 2*^ vector additions and the overall cost is 2'^ + m — k + k'^ vector additions 
in the worst case (where k'^ accounts for the Gauss elimination of the k x n submatrix). 
The naive approach would cost k ■ m row additions in the worst case to clear k columns. 
If we set k = log m then the complexity of clearing k columns is O (m + log^ m) vector 
additions in contrast to 0{m ■ logm) vector additions using the naive approach. 

This idea leads to Algorithm 1. In this algorithm the subroutine GaussSubmatrix 
(cf. Algorithm 8) performs Gauss elimination on a A: x n submatrix of A starting at 
position (r, c) and searches for pivot rows up to m. If it cannot find a submatrix of rank 
k it will terminate and return the rank k found so far. Note the technicality that the 
routine GaussSubmatrix and its interaction with Algorithm 1 make use of the fact that 
all the entries in a column below a pivot are zero if they were considered already. 

The subroutine MakeTable (cf. Algorithm 7) constructs the table T of all 2*^ linear 
combinations of the k rows starting a row r and a column c, i.e. it enumerates all elements 
of the vector space span(r, r -|- -|- 1) spanned by the rows r, . . . ,r + k — 1. Finally, the 
subroutine AddRowsFromTable (cf. Algorithm 6) adds the appropriate row from T - 
indexed by k bits starting at column c ~ to each row of A with index i ^ {r, . . . ,r + k—l}. 
That is, it adds the appropriate linear combination of the rows {r, . . . ,r + k — 1} onto a 
row i in order to clear k columns. 
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Fig. 1. M4RI Idea 



Note that the relation between the index id and the row j in T is static and known 
a priori because GaussSubmatrix puts the submatrix in reduced row echelon form. In 
particular this means that the k x k submatrix starting at (r, c) is the identity matrix. 



Input: A ~ a, m X n matrix 

Input: A: - an integer fe > 

Result: A is in reduced row echelon form. 

begin 

r,ci — 0,0; 
while c < n do 

if c + k > n then fc n — c; 

k i GaUSSSuBMATRIx(A, r, c, k, m); 

if fc > then 

T, L < — MakeTable(A, r, c, k); 
AddRowsFromTable(A, 0, r, c, fc, T, L); 
AddRowsFromTable(A, r + fe, m, c, k, T, L); 
end 

r, c i — r + k,c + k; 
it k =^ k then c •<— c + 1; 
end 
end 

Algorithm 1: M4RI 



When studying the performance of Algorithm 1, wc expect the function MakeTable 
to contribute most. Instead of performing fc/2- 2*^ — 1 additions MakeTable only performs 

2^ — \ vector additions. However, in practice the fact that k columns are processed in 
each loop iteration of AddRowsFromTable contributes signficiantly due to the better 
cache locality. Assume the input matrix A does not fit into L2 cache. Gaussian elimination 
would load a row from memory, clear one column and likely evict that row from cache in 
order to make room for the next few rows before considering it again for the next column. 
In the M4RI algorithm more columns are cleared per load. 



We note that our presentation of M4RI differs somewhat from that in [6]. The key 
difference is that our variant does not throw an error if it cannot find a pivot within 
the first 3k rows in GaussSubmatrix. Instead, our variant searches all rows and conse- 
quently the worst-case complexity is cubic. However, on average for random matrices we 
expect to find a pivot within 3k rows and thus expect the average-case complexity to be 
0{n^/logn). 

4 M4RI and PLS Decomposition 

In order to recover the PLS decomposition of some matrix A, we can adapt Gaussian 
elimination to preserve the transformation matrix in the lower triangular part of the 
input matrix A and to record all permutations performed. This leads to Algorithm 9 in 
the Appendix which modifies A such that it contains L in below the main diagonal, S 
above the main diagonal and returns P and Q such that PLS = A and SQ^ = U. 
The main differences between Gaussian elimination and Algorithm 9 are: 

— No elimination is performed above the currently considered row, i.e. the rows 0, . . . , 1 — 
1 are left unchanged. Instead elimination starts below the pivot, from row r + 1. 

— Column swaps are performed at the end of Algorithm 9 but not in Gaussian elimina- 
tion. This step compresses L such that it is lower triangular. 

— Row additions are performed starting at column r + 1 instead of r to preserve the 
transformation matrix L. Over any other field we would have to rescale A[r, r] for the 
transformation matrix L but over F2 this is not necessary. 

4.1 The Method of Many People Factorisation (MMPF) 

In order to use the M4RI improvement over Gaussian elimination for PLS decomposition, 
we have to adapt the M4RI algorithm. 

Column Swaps Since column swaps only happen at the very end of the algorithm we can 
modify the M4RI algorithm in the obvious way to introduce them. 

U vs. I Recall, that the function GaussSubmatrix generates small kx k identity matri- 
ces. Thus, even if we remove the call to the function AddRowsFromTable(A, 0, r, c, k, T) 
from Algorithm 1 we would still eliminate up to fc — 1 rows above a given pivot and thus 
would fail to produce U. The reason the original specification [5] of the M4RI requires 
kx k identity matrices is to have a a priori knowledge of the relationship between id and 
j in the function AddRowsFromTable. On the other hand the rows of any kxn upper 
triangular matrix also form a basis for the fc-dimensional vector space span(r, . . . , r+fc— 1). 
Thus, we can adapt GaussSubmatrix to compute the upper triangular matrix instead 
of the identity. Then, in MakeTableI we can encode the actual relationship between a 
row j of T and id in the lookup table L. 

Preserving L In Algorithm 9 preserving the transformation matrix L is straight forward: 
addition starts in column c + 1 instead of c. On the other hand, for M4RI we need to fix 
the table T to update the transformation matrix correctly; For example, assume fc = 3 
and that the first row of the kxn submatrix generated by GaussSubmatrix has the 



first k bits equal to [1 1] . Assume further that we want to clear k bits of a a row 

which also starts with [1 1] . Then - in order to generate L we need to encode that 
this row is cleared by adding the first row only, i.e. we want the first fc = 3 bits to be [1 
0] . Recall that in the M4RI algorithm the id for the row j starting with [1 0] is [1 
0] if expressed as a sequence of bits. Thus, to correct the table, we add the k bits of 
the a priori id onto the first k entries in T (starting at c) as in MakeTableI. 

Other Bookkeeping Recall that GaussSubmatrix's interaction with Algorithm 1 uses the 
fact that processed columns of a row are zeroed out to encode whether a row is "done" or 
not. This is not true anymore if we compute the PLS decomposition instead of the upper 
triangular matrix in GaussSl'BMATRIX since we store L below the main diagonal. Thus, 
we explicitly encode up to which row a given colimin is "done" in PlsSubmatrix (cf. 
Algorithm 10). Finally, we have to take care not to include the transformation matrix L 
when constructing T. 



Input; A - & m X n matrix 
Input: Tstart " an integer < rstart < m 
Input: Cstart - an integer < Cgtart < ri 
Input: - an integer A; > 

Result: Retuns an 2*^ x n matrix T and the translation table L 
begin 

T i — the 2* X n zero matrix; 
for 1 < i < 2*^ do 

j < — the row index of A to add according to the Gray code; 
add row j of A to the row i of T starting at Cstart ; 
end 

L i — an integer array with 2* entries; 
for 1 < i < 2* do 

L[id] i — i; 
end 

for 1 < z < 2*= do 

6o, . . . , b^_i < — bits of a priori id of the row i; 
for < j < k do 

I T[i, Cstart + j] < — T[i, Cstart + j] + bj ; 
end 
end 

return T, L; 
end 

Algorithm 2: MakeTableI 



These modifications lead to Algorithm 3 which computes the PLS decomposition of 
A in-place, that is L is stored below the main diagonal and S is stored above the main 
diagonal of the input matrix. Since none of the changes to the M4RI algorithm affect 
the asymptotical complexity. Algorithm 3 is cubic in the worst case and has complexity 
0(n^/logn) in the average case. 



Input: A - a. m X n matrix 

Input: P - a permutation vector of length m 

Input: Q - a permutation vector of length n 

Input: fe - an integer A: > 

Result: PLS decomposition of A 

begin 

r,ci — 0,0; 

for < i < n do Q[i] < — i; 
for < i < m do P[i] < — i; 
while r < m and c < n do 

if c + k > n then k i — n — c; 

k, dr < — PlsSubmatrix(A, r, c, k, P, Q); 

U i — the k X n submatrix starting at (r, 0) where every entry prior to the upper 
triangular matrix starting at (r, c) is zeroed out; 
if fc > then 

T,Li — MakeTable1((7, 0, c, fc); 

AddRowsFromTable(A, dr + 1, m, c, k, T, L); 

r, c •<— r + fe, c + fe; 
else 

// skip zero column 
c c+ 1; 
end 
end 

// Now compress L 

for < j < r do swap the columns j and Q[j] starting at row j; 
return r; 
end 

Algorithm 3: MMPF 



5 Asymptotically Fast PLS Decomposition 

It is well-known that PLUQ decomposition can be accomplished in-place and in time 
complexity ©(n") by reducing it to matrix-matrix multiplication (cf. [13]). We give a 
slight variation of the recursive algorithm from [13] in Algorithm 4. We compute the PLS 
instead of the PLUQ decomposition. 

In Algorithm 4 the routine SuBMATRlx(r;5, Cg, fe, Ce) returns a "view" (cf. [3]) into 
the matrix A starting at row and column and Cg resp. and ending at row and column 
Te and Ce resp. We note that that the step Aj^e < — ^mv ^ ^ne can be reduced 
to matrix-matrix multiplication (cf. [13]). Thus Algorithm 4 can be reduced to matrix- 
matrix multiplication and has complexity C(n"). Since no temporary matrices arc needed 
to perform the algorithm, except maybe in the matrix-matrix multiplication step, the 
algorithm is in-place. 

6 Implementation 

Similarly to matrix multiplication (cf. [3]) it is beneficial to call Algorithm 4 until some 
"cutoff" bound and to switch to a base-case implementation (in our case Algorithm 3) 



Input: A - a. m X n matrix 

Input: P - a permutation vector of length m 

Input: Q - a permutation vector of length n 

Result: PLS decomposition of A 

begin 

no < — pick some integer < no < n; // no « n/2 
^0 < — SuBMATRix(yl,0,0,m,no); 
Ai < — SubMatrix(^,0, no, m,n); 
Qo < — Q[0, ■ ■ • ,no]; 

T'o < — PLS{Ao, P,Qo); II first recursive call 

for < i < no do Q[i] <— Qo[i]; 

^ATw •« — SubMatrix(A, 0, 0, ro,ro); 

Asw < — SubMatrix(A, ro,0,m,ro); 

Ane < — SuBMATRix(^,0,no,ro,n); 

AsE < — SuBMATRix(j4,ro, no, m,n); 

if ri then 

// Compute of the Schur complement 

Aii — P X Ai; 

Lnw < — the lower left triangular matrix in Anw\ 

Ane < — ^mv ^ Ane', 
AsE i — AsE + Asw X Ane; 
end 

Pi < — P[ro, . . . ,m]; 
Qi < — Q[no, . . . ,n]; 

ri i — PLS{AsE, Pi,Qi); II second recursive call 

Asw < — P X Asw; 
II Update P & Q 

for < i < m - ro do P[ra + 1] = Pi [i] + ro; 
for < i < n — no do Q[no +i] <r- Qi[i] + no; 
j ^ ro; 

for no < i < no + ri do Qlj] Q[i]; j ^ j + 1; 
// How compress L 

j <- no; 

for ro < i < ro + ri do swap the columns i and j starting at row i; 
return ro + ri; 
end 

Algorithm 4: PLS Decomposition 



once this bound is reached. We perform the switch over if the matrix fits into 4MB or 
in L2 cache, whichever is smaller. These values seem to provide the best performance on 
our target platforms. 

The reason we are considering the PLS decomposition instead of either the LQUP or 

the PLUQ decomposition is that the PLS decomposition has several advantages over F2, 
in particular when the flat row-major representation is used to store entries. 

— We may choose where to cut with respect to columns in Algorithm 4. In particular, we 
may choose to cut along word boundaries. For LQUP decomposition, where roughly 
all steps are transposed, column cuts are determined by the rank rg. 

— In Algorithm 3 rows are added instead of columns. Row operations are much cheaper 
than column operations in row-major representation. 

— Column swaps do not occur in the main loop of either Algorithm 4 or 3, but only row 
swaps are performed. Column swaps are only performed at the end. Column swaps 
are much more expensive than row swaps (see below). 

— Fewer column swaps are performed for PLS decomposition than for PLUQ decompo- 
sition since U is not compressed. 

One of the major bottleneck arc column swaps. In Algorithm 5 a simple algorithm for 
swapping two columns a and h is given with bit-level detail. In Algorithm 5 we assume 
that the bit position of a is greater than the bit position of b for simplicity of presentation. 
The advantage of the strategy in Algorithm 5 is that it uses no conditional jumps in the 
inner loop. However, it still requires 9 instructions per row. On the other hand, we can 
add two rows with 9 • 128 = 1152 entries in 9 instructions if the SSE2 instruction set 
is available. Thus, for matrices of size 1152 x 1152 it takes roughly the same number of 
instructions to add two matrices as it does to swap two columns. If we were to swap 
every column with some other column once during some algorithm it thus would be as 
expensive as a matrix multiplication for matrices of these dimensions. 



Input: A - a.my.n matrix 
Input: o - an integer < o < 6 < n 
Input: h an integer Q < a < b < n 
Result: Swaps the columns a and 6 in A 
begin 

M < — the memory where A is stored; 
dm, hm i — the word index of a and 6 in M; 
ab , hb < — the bit index of a and b in Uw and bw ; 
A < — 06 — by, 

ttm < — the bit-mask where only the ojth bit is set to 1; 
b,n < — the bit-mask where only the 6(,th bit is set to 1; 
for < i < m do 

R < — the memory where the row i is stored; 
R[a^] < — R[a^] © 6m) » A); 

< — R[bn,] ((JiKl © aj) « A); 
R[aw] < — Rla-w] ((Ji[6^] © 6m) » A); 
end 
end 

Algorithm 5: Column Swap 



Another bottleneck for relatively sparse matrices in dense row-major representation is 
the search for pivots. Searching for a non-zero element in a row can be relatively expensive 
due to the need to identify the bit position. However, the main performance penalty is 
due to the fact that searching for a non-zero entry in one column is in a row-major 
representation is very cache unfriendly. 

Indeed, both our implementation and the implementation available in Magma suffer 
from performance regression on relatively sparse matrices as shown in Figure 2. We stress 
that this is despite the fact that the theoretical complexity of matrix decomposition is rank 
sensitive, that is, strictly less field operations have to be performed for low rank matrices. 
While the penalty for relatively sparse matrices is much smaller for our implementation 
than for MAGMA, it clearly does not achieve the theoretical possible performance. Thus, 
we also consider a hybrid algorithm which starts with M4RI and switches to PLS-based 
elimination as soon as the (approximated) density reaches 15%, denoted as 'M-l-P 0.15'. 




7 Results 

In Table 1 we give average running time over ten trials for computing reduced row echelon 
forms of dense random n x n matrices over F2. We compare the asymptotically fast im- 
plementation due to Allan Steel in Magma, the cubic Gaussian elimination implemented 



by Victor Shoup in NTL, and both our implementations. Both the implementation in 
Magma and our PLS decomposition reduce matrix decomposition to matrix multiplica- 
tion. A discussion and comparison of matrix multiplication in the M4RI library and in 
Magma can be found in [3]. In Table 1 the column 'PLS' denotes the complete running 
time for first computing the PLS decomposition and the computation of the reduced row 
echelon form from PLS. 





64-bit Linux, 2.6Ghz Opteron 


64-bit Linux, 2.33Giiz Xeon (E5345) 


n 


Magma 


NTL 


M4RI 


PLS 


Magma 


NTL 


M4RI 


PLS 




2.15-10 


5.4.2 


20090105 


20100324 


2.16-7 


5.4.2 


20100324 


20100324 


10,000 


3.351s 


18.45s 


2.430s 


1.452s 


2.660s 


12.05s 


1.360s 


0.864s 


16,384 


11.289s 


72.89s 


10.822s 


6.920s 


8.617s 


54.79s 


5.734s 


3.388s 


20, 000 


16.734s 


130.46s 


19.978s 


10.809s 


12.527s 


100.01s 


10.610s 


5.661s 


32,000 


57.567s 


479.07s 


83.575s 


49.487s 


41.770s 


382.52s 


43.042s 


20.967s 


64, 000 


373.906s 


2747.41s 


537.900s 


273.120s 


250.193s 




382.263s 


151.314s 



Table 1. RREF for random matrices 



In Table 2 we give running times for matrices as they appear when solving non-linear 
systems of equations. The matrices HFE 25, 30 and 35 were contributed by Michael 
Brickcnstcin and appear during a Grobncr basis computation of HFE systems using 
PolyBoRi. The Matrix MXL was contributed by Wael Said and appears during an 
execution of the MXL2 algorithm [15] for a random quadratic system of equations. We 
consider these matrices within the scope of this work since during matrix elimination the 
density quickly increases and because even the input matrices are dense enough such that 
we expect one non-zero element per 128-bit wide SSE2 XOR on average. The columns 
'M-l-P O.xx' denote the hybrid algorithms which start with M4RI and switch over to PLS 
based echelon form computation once the density of the remaining part of the matrix 
reaches 15% or 20% respectively. We note that the relative performance of the M4RI and 
the PLS algorithm for these instances depends on particular machine configuration. To 
demonstrate this we give a set of timings for the Intel Xeon X7460 machine sage. math** 
in Table 2. Here, PLS always is faster than M4RI, while on a Xeon E5345 M4RI wins for 
all HFE examples. We note that Magma is not available on the machine sage .math. The 
HFE examples show that the observed performance regression for sparse matrices does 
have an impact in practice and that the hybrid approach does look promising for these 
instances. 
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64-bit Fedora Linux, 2.33Ghz Xeon 


(E5345) 


Problem 


Matrix 


Density 


Magma 


M4RI 


PLS 


M+F 0.15 


M+P 0.20 




Dimension 




2.16-7 


20100324 


20100324 


20100429 


20100429 


HFE 25 


12,307 X 13,508 


0.076 


3.68s 


1.94s 


2.09s 


2.33s 


2.24s 


HFE 30 


19, 907 X 29, 323 


0.067 


23.39s 


11.46s 


13.34s 


12.60s 


13.00s 


HFE 35 


29, 969 X 55, 800 


0.059 




49.19s 


68.85s 


66.66s 


54.42s 


MXL 


26, 075 X 26, 407 


0.185 


55.15 


12.25s 


9.22s 


9.22s 


10.22s 








64-bit Ubuntu Linux, 2.66Ghz Xeon (X7460) 


Problem 


Matrix 


Density 




M4RI 


PLS 


M+P 0.15 


M-l-P 0.20 




Dimension 






20100324 


20100324 


20100429 


20100429 


HFE 25 


12,307 X 13,508 


0.076 




2.24s 


2.00s 


2.39s 


2.35s 


HFE 30 


19,907 X 29,323 


0.067 




27.52s 


13.29s 


13.78s 


22.9s 


HFE 35 


29, 969 X 55, 800 


0.059 




115.35s 


72.70s 


84.04s 


122.65s 


MXL 


26, 075 X 26, 407 


0.185 




26.61s 


8.73s 


8.75s 


13.23s 








64-bit Debian/GNU Linux, 2.6Ghz Opteron) 


Problem 


Matrix 


Density 


Magma 


M4RI 


PLS 


M+P 0.15 


M+P 0.20 




Dimension 




2.15-10 


20100324 


20100324 


20100429 


20100429 


HFE 25 


12,307 X 13,508 


0.076 


4.57s 


3.28s 


3.45s 


3.03s 


3.21s 


HFE 30 


19,907 X 29,323 


0.067 


33.21s 


23.72s 


25.42s 


23.84s 


25.09s 


HFE 35 


29, 969 X 55, 800 


0.059 


278.58s 


126.08s 


159.72s 


154.62s 


119.44s 


MXL 


26, 075 X 26, 407 


0.185 


76.81s 


23.03s 


19.04s 


17.91s 


18.00s 



Table 2. RREF for matrices from practice. 
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A Support Algorithms 



Input: A - a, m X n matrix 

Input: Tstart " an integer < rstart < m 

Input: Tend - an integer < rstart < J"end < m 

Input: Cstart - an integer < Cstart < n 

Input: A; - an integer fe > 

Input: T - a 2* X n matrix 

Input: L - an integer array of length 2*^ 

begin 

for rstart < i < Tend do 

irf = E'=0^[», Cstart +j]-2'=->-l; 

j i — L[id\; 

add row j from T to the row i of A starting at column Cstart; 
end 
end 

Algorithm 6: AddRowsFromTable 



Input: A - a m X n matrix 

Input: T'start - an integer < rstart < m 

Input: Cstart - an integer < Cstart < n 

Input: A; - an integer A; > 

Result: Retuns an 2*° x n matrix T 

begin 

T i — the 2* X n zero matrix; 
for 1 < i < 2*= do 

j < — the row index of A to add according to the Gray code; 
add row j of A to the row i of T starting at column Cstart; 
end 

L i — integer array allowing to index T by fc bits starting at column Cstart; 
return T, L; 
end 

Algorithm 7: MakeTable 



Input: A - a, rn X n matrix 

Input: r - an integer < r < m 

Input: c - an integer < c < n 

Input: fc - an integer k > 

Input: rend - an integer < r < Tend < rn 

Result: Returns the rank k < k and puts the fc x (n — c) submatrix starting at A[r, c] in 
reduced row echelon form. 

begin 

rs < — r; 

for c < j < c + k do 
found i — False; 
for rs <i < rend do 

for <l < j — c do // clear the first columns 

I if A[i,c + 1] ^ then add row r + Z to row i of A starting at column c + l; 
end 

if A[i,j] / then // pivot? 
Swap the rows i and rs in A; 
for r < I < rs do / / clear above 

I if j4[Z, ^ then add row rs to row I in A starting at column j; 
end 

rs < — rs + 1; 
found < — True; 
break; 
end 
end 

if found = False then 

I return j - c; 
end 
end 

return j - c; 
end 

Algorithm 8: GaussSubmatrix 



Input: A - a m X n matrix 

Input: P - a permutation vector of length m 

Input: Q - a permutation vector of length n 

Result: PLS decomposition of A. Returns the rank of A. 

begin 

r,c ^ 0,0; 

while r < m and c < n do 

found < — False; 

for c < j < n do // search for some pivot 
for r < i < m do 

I if j] then found <— True and break; 
end 

if found then break; 
end 

if found then 

P[r],Q[r]^i,j; 

swap the rows r and i in A; 

// clear below but preserve transformation matrix 
if J + 1 < n then 

for r + 1 < / < m do 
if A[l,j] then 

I add the row r to the row I starting at column j + 1; 
end 
end 
end 

r,c< — r + l,j + 1; 
else 

I break; 
end 
end 

for r < i < m do P[i] < — i ; 
for r < i < n do Q[i] < — i ; 
// Now compress L 

for < J < r do swap the columns j and Q[j] starting at row j; 
return r; 
end 

Algorithm 9: Gaussian PLS Decomposition 



Input: 


A 


- a m X n matrix 


Input: 




- an integer < Sr < m 


Input: 


Sc 


- an integer < Sc < n 


Input: 


k 


- an integer fe > 


Input: 


P 


- a permutation vector of length m 


Input: 


Q 


- a permutation vector of length n 



Result: Returns the rank k < k and dr - the last row considered. 

Also puts the k X {n — c) submatrix starting at {r, c) in PLS decomposition form. 

begin 

done i — all zero integer array of length k; 
for < r < k do 

found i — False; 

for Sr + r < i < in do // search for some pivot 
for < / < r do // clear before 
if done[l] < i then 

if A[i, Sc + /] / then 

I add row Sr + 1 io row i in A starting at column Sc + 1 + 1; 
end 

done[V\ < — i; 
end 
end 

if A[i, Sc + r\^Q then 

found i — True; 

break; 
end 
end 

if found = False then break; 
P[sr + r],Q[sr + r] i — i,Sc + r; 
swap the rows Sr + r and i in A; 
done[r\ i — i; 
end 

dr < — max({fione[i] | i G {0, . . . ,k — 1}}); 
for < C2 < fe and r + C2 < n — 1 do // finish submatrix 
for done[c2] < r2 < dr do 
if A[r2,r + C2] / then 

I add row r + C2 to row r2 in A starting at column r + C2 + 1; 
end 
end 
end 

return r, dr; 
end 

Algorithm 10: PlsSubmatrix 



